Draft WebCorp: providing a renewable data source for corpus linguists

نویسنده

  • Antoinette Renouf
چکیده

The many electronic text corpora available nowadays present ever fewer obstacles to a wide range of corpus linguistic study. However, corpora are expensive resources to create and to update, and there remain problems for linguists if they seek access to very large, very recent, or changing language. The World Wide Web, whilst intended as an information source, is an obvious resource for the retrieval of linguistic information, being the largest store of texts in existence, freely-available, covering a range of domains, and constantly added to and updated. Individual linguistic researchers have been trying to retrieve instances of rare or neologistic language use from the web by manipulating existing web search engines. Whilst this strategy is possible, in particular via Google, the output is rather haphazard and not linguist-friendly. The Research and Development Unit for English Studies has been seeking to remedy the situation through the creation of' ‘WebCorp’, a tool designed to search the Internet and provide on-line tailored access to linguists. A demonstration tool is available at http://www.webcorp.org.uk This paper will report on the research initiative and highlight some of the issues involved. A previously unimaginable number and range of electronic text corpora are now available to corpus linguists, from small and sampled collections to very large textual databases. Whilst this wealth of data makes possible many types of corpus-based research, particularly in the formerly rather inaccessible areas of lexis and lexico-grammar, it has inherent limitations. In practical terms, the corpus data and software may not be available without the appropriate computer access, licences, and so on. More fundamental linguistic limitations relate to the size, age and static nature of the corpora, which can preclude certain kinds of linguistic empirical investigation, for instance the study of very rare, new or changing language features. An alternative source of linguistic information is the web, a publicly available data resource containing a vast and evolving accumulation of texts. Admittedly, this is not constructed or managed with the rigour or for the purposes of a corpus. It is a muddle of multilinguality; it operates a loose definition of 'text' which includes all manner of extraneous matter; text dating is sporadic and linguistically uninterpretable, so that neither the latest coinages nor the elements of language change across time that are undeniably in there are traceable by

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

WebCorp: providing a renewable data source for corpus linguists

The many electronic text corpora available nowadays present ever fewer obstacles to a wide range of corpus linguistic study. However, corpora are expensive resources to create and to update, and there remain problems for linguists if they seek access to very large, very recent, or changing language. The World Wide Web, whilst intended as an information source, is an obvious resource for the ret...

متن کامل

Towards standards for corpus query: Work on a Lingua Franca for corpus query

In this presentation, we report about the ongoing work on the development of a standard for corpus query languages. This work takes place in the context of the ISO TC37/SC4 WG6 activity on the suggested work item proposal „Corpus Query Lingua Franca“ (Bański and Witt, 2011). We have collected a set of requirements on a corpus query language motivated by the needs of linguists and we will presen...

متن کامل

UNCORRECTED DRAFT . For the final version , see Automated Building of

For most languages, including Polish, big error corpora are lacking. Traditional error corpora are collected and annotated by linguists, and the process is manual or only slightly automated. The task is therefore tedious and costly, and the results represent linguists’ knowledge about correct usage. This requires additional work to avoid theory-laden distortion of data. In this paper, I will sh...

متن کامل

Gearing the Discursive Practice to the Evolution of Discipline: Diachronic Corpus Analysis of Stance Markers in Research Articles’ Methodology Section

Despite widespread interest and research among applied linguists to explore metadiscourse use, very little is known of how metadiscourse resources have evolved over time in response to the historically developing practices of academic communities. Motivated by such an ambition, the current research drew on a corpus of 874315 words taken from three leading journals of applied linguistics in orde...

متن کامل

A WaCky Introduction

We use the Web today for a myriad purposes, from buying a plane ticket to browsing an ancient manuscript, from looking up a recipe to watching a TV program. And more. Besides these “proper” uses, there are also less obvious, more indirect ways of exploiting the potential of the Web. For language researchers, the Web is also an enormous collection of (mainly) textual materials which make it poss...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009